Connection tracking records for a very large scale nat engine

ABSTRACT

Some embodiments provide a novel method for performing network address translation to share a limited number of external source network addresses among a large number of connections. Instead of allocating an external source network address for an egressing packet just based on its internal source network address, the method of some embodiments allocates the external source network address based on the egressing packet&#39;s source network address and destination network address. This allows a limited number of external source network addresses to be re-used for different destination network address. For instance, in some embodiments, the method&#39;s network address allocation scheme allows the same 64K (e.g., 2{circumflex over ( )}16) external source ports to be used for 64K connections for each destination network address.

BACKGROUND

Datacenters and other private or public networks with internaladdressing schemes often have large numbers of machines and processes onthose machines that require connections to the Internet or other outsidenetworks. Communications on the Internet often take the form of packetswith source addresses and destination addresses that each include an IPaddress (often an IP4 address) and a port number. There are a limitednumber of available IP4 addresses (external source IP addresses), so adatacenter will often have more machines that need connections than IP4addresses assigned to the datacenter. This problem is partiallyaddressed by network address translation (NAT) systems which assign eachoutgoing flow of packets, from a machine of the datacenter, a singleport of one of the limited number of IP addresses of the datacenter.However, as each IP address only has 64K available ports, a datacenterwith a large number of machines, or a large number of processes runningon those machines that require outside connections, can exhaust allavailable ports on the datacenter's external source IP addresses.Therefore, there is a need in the art for a NAT system that providesexternal source IP addresses and port for a very large number ofconnections between machines of the datacenter and machines external tothe datacenter.

BRIEF SUMMARY

Some embodiments provide a novel method for performing network addresstranslation to share a limited number of external source networkaddresses among a large number of connections. This method isimplemented in some embodiments by a first network (e.g., a private orpublic datacenter network) that uses the limited number of externalsource network addresses to communicate with one or more other networks.In some embodiments, a gateway of the first network performs theoperations of this method.

Instead of allocating an external source network address for anegressing packet just based on its internal source network address, themethod of some embodiments allocates the external source network addressbased on the egressing packet's source network address and destinationnetwork address. This allows a limited number of external source networkaddresses to be re-used for different destination network addresses. Forinstance, in some embodiments, the method's network address allocationscheme allows the same 64K (e.g., 2{circumflex over ( )}16) externalsource ports to be used for 64K connections for each destination networkaddress.

To keep track of the allocated external source network addresses, themethod of some embodiments creates connection-tracking records that mapallocated external source network addresses to internal source networkaddresses and external destination network addresses. The method of someembodiments creates a connection-tracking record for an egressing packetflow when it receives the first packet of the flow. The packet includesa header with an internal source address (e.g., an address of a machineor device in the datacenter), and an external destination address (e.g.,an address of a machine or device outside of the datacenter).

From a pool of a limited number of external source network addresses,the method allocates an external source network address for the packetflow, replaces the internal source address of the first packet with theallocated external source network address, and forwards the first packetto its destination. The method also creates a connection-tracking record(1) for subsequent packets of the first flow, and (2) for a secondpacket flow that is received in response to the first packet flow. Insome embodiments, this connection-tracking record includes twosub-records: a first sub-record for the forward direction (i.e., for thesubsequent packets of the first flow) and a second sub-record for thereverse direction (i.e., for the packets of the second packet flow thatis in response to the first packet flow).

The first sub-record maps the combination of the internal source networkaddress and external destination address to the allocated externalsource network address, in order to translate the source network addressof packets in the first flow to the allocated external source networkaddress. The second sub-record maps the combination of the externaldestination address and the external source network address to theinternal source network address, in order to translate the destinationaddresses of packets in the second flow to the internal source networkaddress. Instead of creating two sub-records for the forward and reversedirections, the method of some embodiments creates only one mappingconnection-tracking record, and then uses different portions of thisrecord for the match and action attributes in the forward and reversedirections.

In some embodiments, the connection-tracking record uses both theexternal destination address and external source network address formapping to the internal network address because the external sourcenetwork address is also used as an external source network address fordifferent packet flows to different destination addresses. For instance,as mentioned above, the method of some embodiments uses the sameexternal source IP address along with the same 64K source port range formultiple different destination IP addresses. The method of someembodiments further extends the sharing of the 64K source ports by usingthis port range not only for different destination IP addresses, butalso for different destination port addresses to the same destination IPaddress. This allows the method to support up to 4,294,967,296 (64Kmultiplied by 64K) connections for each destination IP address.

The method of some embodiments stores the connection-tracking records ina connection-tracking storage. The method creates a connection-trackingrecord by generating, from a set of header values in the packet headerof a packet flow, a hash value that identifies a location in theconnection-tracking storage. The method then stores theconnection-tracking record at the location in the storage identified bythe hash value. Each hash-addressable location in the storage may storea linked list of zero or more connection-tracking records (e.g., maystore a pointer to such a list or be a location of an entry of such alist). Each connection-tracking record is associated with a differentpacket flow to the same destination IP address, and in some embodimentsto the same destination port address.

The method of some embodiments creates different connection-trackingdata stores for each different destination IP address, and in someembodiments for each different combination of destination IP and portaddresses. In such embodiments, the method stores theconnection-tracking record for a flow in the connection-tracking datastore that is defined for the flow's destination IP address, ordestination IP/port address pair. For subsequent packets in the sameflow or reverse flow, the method in some embodiments first has toidentify the connection-tracking data store that is associated with theflow, before identifying the connection-tracking record for the flow.

Some embodiments provide an efficient method of allocating externalsource port addresses for the multiple connections that share thelimited set of external source IP addresses for a destination IPaddress, or destination IP/port address pair, outside of a network. Forinstance, in some embodiments, the method specifies multiplepre-allocated port groups, each with multiple external port addresses.The port addresses are each port addresses corresponding to the sameexternal source network IP address. When external port addresses areavailable in the pre-allocated port groups, the method allocatesexternal port addresses from the pre-allocated port groups for newconnections to the destination IP address. The method also dynamicallymodifies the number of pre-allocated port groups as the number ofconnections to destinations outside of the network increases ordecreases. Each pre-allocated group may include several source portaddresses and these source port addresses may be contiguous in aparticular group. In some embodiments, the method performs theseoperations in order to provide a fast and efficient mechanism for (i)tracking source port addresses assigned to connections to thedestination IP address, and (ii) allocating new source port addresseswhen no previously pre-allocated source port addresses are available.

As the number of connections to a particular destination address rises,the pre-allocated port groups may assign all of their available ports.In order to assign new ports, the method dynamically modifies the numberof pre-allocated port groups by identifying a new connection for whichan external source port has to be assigned, determining that theexisting set of pre-allocated port groups does not have an externalsource port address available to assign to the new connection, andspecifying a new set of pre-allocated port groups. The method thenallocates an external port address from the new set of pre-allocatedport groups.

In some embodiments, dynamically modifying the number of groups mayinclude reducing the number of groups. The method may reduce the numberof groups by identifying a pre-allocated group that was (i) previouslyused to assign source port addresses to connections and (ii) has had allof its pre-allocated source ports unassigned for at least some thresholdperiod. The method then removes the identified pre-allocated group fromthe pre-allocated port groups. Removing the identified pre-allocatedgroup may mean deleting the group, setting the group into an idle state,or otherwise eliminating it from use, at least temporarily.

The method of some embodiments provides an efficient way to search themultiple pre-allocated port groups. In some such methods, eachparticular pre-allocated group includes a set of metadata. The metadataincludes an indicator of a number of ports available in the group forallocating to packet flows and a next port available for allocating to apacket flow. The method modifies the number of available ports and thenext-available port in the metadata set of the particular pre-allocatedgroup as ports from the particular group are allocated to packet flowsand as ports from the particular group are de-allocated from packetflows that have been terminated.

The method may also determine whether a pre-allocated group has anavailable port for allocation to a new packet flow by examining thenumber of available ports in the metadata set of the group. When thenumber of available ports is zero for a group, the method selectsanother pre-allocated group from which a port should be selected for thenew packet flow. The method of some embodiments also determines whetherthe set of pre-allocated ports indicates any available ports at all byiteratively examining metadata for each pre-allocated port group. If ametadata for a particular pre-allocated port group indicates anavailable port, the method identifies that the port is available.However, if the metadata for each pre-allocated port group indicates noports are available in that pre-allocated port group, the methoddetermines that no port tracked by the present pre-allocated port groupis available. In some embodiments, determining from the metadata that noport is available in the present pre-allocated groups of ports causesthe method to dynamically increase the number of pre-allocated portgroups.

The method of some embodiments defines different sets of pre-allocatedport groups, with each set associated with a different externaldestination IP address. The method identifies, for a new packet flow,the port-group set associated with an external destination IP addressstored in a header field of the new flow. The method then allocates, forthe new packet flow, an external port address from a particularpre-allocated port group in the identified port-group set. The methodmay also define, for each external destination IP address, aconnection-tracking data store for storing connection-tracking recordsthat map allocated external source port addresses to internal source IPand port addresses within the network. The connection-tracking recordsare used in performing network address translation on packets of flowsexiting or entering the network.

The method of some embodiments allocates a bitmap of available sourceports and allocates contiguous blocks of source ports in the bitmap todifferent pre-allocated groups of ports. The method uses the bitmap toidentify the pre-allocated port groups and adjust the number ofpre-allocated port groups.

The preceding Summary is intended to serve as a brief introduction tosome embodiments of the invention. It is not meant to be an introductionor overview of all inventive subject matter disclosed in this document.The Detailed Description that follows and the Drawings that are referredto in the Detailed Description will further describe the embodimentsdescribed in the Summary as well as other embodiments. Accordingly, tounderstand all the embodiments described by this document, a full reviewof the Summary, Detailed Description, the Drawings and the Claims isneeded. Moreover, the claimed subject matters are not to be limited bythe illustrative details in the Summary, Detailed Description and theDrawing.

BRIEF DESCRIPTION OF THE DRAWINGS

The novel features of the invention are set forth in the appendedclaims. However, for purpose of explanation, several embodiments of theinvention are set forth in the following figures.

FIG. 1 conceptually illustrates a process of some embodiments thatcreates a connection-tracking record for a packet flow.

FIG. 2 illustrates a network address translation (NAT) system with aconnection tracker.

FIG. 3 illustrates a network address translation system with aconnection tracker that includes external destination ports in itsattributes.

FIG. 4 conceptually illustrates a source-port allocation process of someembodiments.

FIG. 5 illustrates pre-allocation of groups of source port addresses.

FIG. 6 conceptually illustrates a process of some embodiments toidentify and allocate available ports using metadata of pre-allocatedport groups.

FIG. 7 illustrates iteratively searching metadata of pre-allocated portgroups.

FIG. 8 conceptually illustrates a process of some embodiments forde-allocating ports.

FIG. 9 illustrates modifications of pre-allocated port groups andassociated metadata upon the de-allocation of ports.

FIG. 10 illustrates the use of a hash function to locate trackingrecords.

FIG. 11 conceptually illustrates an electronic system with which someembodiments of the invention are implemented.

DETAILED DESCRIPTION

In the following detailed description of the invention, numerousdetails, examples, and embodiments of the invention are set forth anddescribed. However, it will be clear and apparent to one skilled in theart that the invention is not limited to the embodiments set forth andthat the invention may be practiced without some of the specific detailsand examples discussed.

Some embodiments provide a novel method for performing network addresstranslation to share a limited number of external source networkaddresses among a large number of connections. This method isimplemented in some embodiments by a first network (e.g., a private orpublic datacenter network) that uses the limited number of externalsource network addresses to communicate with one or more other networks.In some embodiments, a gateway of the first network performs theoperations of this method.

Instead of allocating an external source network address for anegressing packet just based on its internal source network address, themethod of some embodiments allocates the external source network addressbased on the egressing packet's source network address and destinationnetwork address. This allows a limited number of external source networkaddresses to be re-used for different destination network address. Forinstance, in some embodiments, the method's network address allocationscheme allows the same 64K external source ports to be used for 64Kconnections for each destination network address.

To keep track of the allocated external source network addresses, someembodiments create connection-tracking records that map allocatedexternal source network addresses to internal source network addressesand external destination network addresses. FIG. 1 conceptuallyillustrates a process 100 of some embodiments that creates such aconnection-tracking record for a packet flow. The process 100 createsthe connection-tracking record for an egressing packet flow when itreceives the first packet of the flow. The packet includes a header withan internal source address (e.g., an address of a machine or device inthe datacenter) and an external destination address (e.g., an address ofa machine or device outside of the datacenter).

The process 100 receives (at 105) a packet of a packet flow. The packetincludes a header with an internal source address and an externaldestination address. The internal source address is an address of amachine or device in the datacenter. The external destination address isan address of a machine or device outside of the datacenter such as anexternal website IP address.

From a pool of external source network addresses, the process 100allocates (at 110) an external source network address for the packetflow. The allocated external source network address will serve as anaddress on the Internet (or other outside network) to which thedestination machine can send reply packets.

The process 100 then uses (at 115) the allocated address to performnetwork address translation on the packet before forwarding the packetto its destination. The network address translation replaces theinternal source address with the allocated external source networkaddress. When the packet reaches its destination, the replaced sourceaddress provides the destination machine with a location to which tosend reply packets over the Internet or other external network.

For a second packet flow that is received in response to the firstpacket flow, the process 100 creates (at 120) a connection-trackingrecord. The connection-tracking record maps the external destinationaddress and the external source network address to the internal sourceaddress, in order to translate the destination addresses of packets inthe second flow to the internal network address. This mapping allowspackets sent in response to the first flow to be routed to the correctmachine of the datacenter.

The mapping is used in the method of some embodiments by a networkaddress translator (NAT) to implement an addressing system thatdetermines the internal machine to which an incoming packet is addressedbased on both the external source network address that the network usesto communicate with an external machine and the external destinationaddress. This allows the NAT to route packets to the correct internalmachine when the external source network address and port are receivingpackets from an arbitrarily large number of external machines. That is,the same external source IP and source port address could receivepackets from thousands or more different external machines and correctlyroute the packets to the machines at the appropriate internal networkaddresses. The NAT uses the connection-tracking records to identify theinternal machine for each incoming packet flow.

FIG. 2 illustrates a network address translation (NAT) system with aconnection tracker. FIG. 2 includes connection tracker 210, networkaddress translator (NAT) 220, and connection tracker storage 230.

The connection tracker storage 230 includes connection-tracking recordsused by the NAT 220 to determine which internal addresses to apply toincoming packets and which external source addresses to apply tooutgoing packets. In some embodiments, the method creates a singleconnection-tracking record (1) for packets of a first, outgoing packetflow, and (2) for packets of a second, incoming packet flow that isreceived in response to the first packet flow.

In FIG. 2 , each connection-tracking record in the connection trackerstorage 230 includes multiple attributes. The NAT 220 uses theseattributes as a set of match attributes and a set of action attributes.In the method of some embodiments, which attributes are used as thematch attributes and which are used as action attributes depends onwhether the packet to be matched is an incoming or an outgoing packet.For example, for an incoming packet, the match attribute set for eachconnection-tracking record are external destination IP (E.Dest. IP) andexternal source network addresses (E.Src.IP and E.Src.port), while theaction attributes are internal source addresses (I.Src.IP andI.Src.port).

The network address translator 220 identifies the source and destinationaddresses for a header of an incoming packet received at the network anduses the connection tracker 210 to determine whether the source anddestination addresses in the header match the (incoming) matchattributes of a connection-tracking record in the connection trackerstorage 230. The network address translator 220 then uses the (incoming)action attributes, I.Src.IP and I.Src.port, to replace the destinationnetwork address of the reply packet with the internal network addressstored by the connection-tracking record. The reply packet is thenforwarded to the machine at the internal network IP address and internalnetwork port.

For outgoing packets, the match attributes are (E.Dest.IP) and internalsource addresses (I.Src.IP and I.Src.port) while the action attributesare the external source network addresses (E.Src.IP and E.Src.port). TheNAT 220 identifies the source and destination addresses for a header ofan outgoing packet to be sent from the network. The NAT 220 then usesthe connection tracker 210 to determine whether the source anddestination addresses in the header match the (outgoing) matchattributes of a connection-tracking record in the connection trackerstorage 230. The network address translator 220 then uses the (outgoing)action attributes, E.Src.IP and E.Src.port, to replace the destinationnetwork address of the outgoing packet with the internal network addressstored by the connection-tracking record. The outgoing packet is thenforwarded to the external destination. Some embodiments use hashfunctions to quickly locate connection-tracking records inconnection-tracking storages. The use of such hash functions is furtherdescribed with respect to FIG. 10 , below.

The above described method creates only one mapping connection-trackingrecord, and then uses different portions of this record for the matchand action attributes for flows in the forward and reverse directions.However, in some embodiments, the connection-tracking record includestwo sub-records, a first sub-record for the forward direction (i.e., forthe subsequent packets of the outgoing flow), and a second sub-recordfor the reverse direction (i.e., for the packets of the incoming packetflow that is in response to the outgoing packet flow). In suchembodiments, the first sub-record maps the combination of the internalsource network address and external destination address to the allocatedexternal source network address while the second sub-record maps thecombination of the external destination address and the external sourcenetwork address to the internal source network address. In otherembodiments two separate connection-tracking records similar to suchsub-records are created, one record for outgoing packets of a flow andone for incoming packets of a response flow.

In some embodiments, in addition to the connection-tracking recordincluding the external destination IP, the match attributes also includean external destination port. Such a system allows up to 64K connectionsfor each external destination IP/external destination port pair ratherthan 64K connections for each external destination IP. FIG. 3illustrates a network address translation system with a connectiontracker 310 that includes external destination ports in its attributes.The network address translator 320 uses a connection tracker 310 toaccess connection tracker storage 330 in which each connection-trackingrecord includes, for incoming packets, (1) match attributes: externaldestination IP (E.Dest.IP), external destination port (E.Dest.port), andexternal source network addresses (E.Src.IP and E.Src.port), and (2)action attributes: internal source addresses (I.Src.IP and I.Src.port).For outgoing packets, the match attributes are: external destination IP(E.Dest.IP), external destination port (E.Dest.port), and internalsource addresses (I.Src.IP and I.Src.port), while the action attributesare: external source network addresses (E.Src.IP and E.Src.port). Someembodiments that include destination port addresses in theconnection-tracking records also create two sub-records or two separateconnection-tracking records, rather than using a singleconnection-tracking record and determining which attributes are match oraction attributes based on the direction of the packet flow.

In some embodiments, the connection-tracking record uses both theexternal destination address and external source network address formapping to the internal network address because the external sourcenetwork address is used as an external source network address fordifferent packet flows to different destination addresses. For instance,as mentioned above, the method of some embodiments uses the sameexternal source IP address along with the same 64K source port range formultiple different destination IP addresses. The method of someembodiments further extends the sharing of the 64K source ports by usingthis port range not only for different destination IP addresses but alsofor different destination port addresses to the same destination IPaddress. This allows the method to support up to 4,294,967,296 (64Kmultiplied by 64K) connections for each destination IP address.

Because of the potentially huge number of ports to be allocated for themultiple flows, it is useful to have a fast and efficient method ofallocating and de-allocating ports of an external source network address(e.g., an external source IP address). For the sake of efficiency,rather than allocating the 64K ports of an external source IP in arandom manner, some embodiments use a source-port allocation processthat pre-allocates groups of source port addresses, assigns source portaddresses to new connections from these pre-allocated port groups, andadds new and removes old pre-allocated port groups as the number ofconnections increases and decreases.

FIG. 4 conceptually illustrates such a source-port allocation process400. The process 400 will be described by reference to an example shownin FIG. 5 , which illustrates pre-allocation of groups of source portaddresses and dynamic growth and reduction of these groups as the numberof connections increases and decreases. FIG. 5 includes a column of portaddresses 500 of an external source IP address, multiple pre-allocatedport groups 510 at time T=1, multiple pre-allocated port groups 520 attime T=2, and multiple pre-allocated port groups 530 at time T=3. Theavailable ports 500 include 65,536 (64K) ports numbered from 0 to65,535.

The process 400 (of FIG. 4 ) specifies (at 405) multiple pre-allocatedport groups of external port addresses. The port addresses in thepre-allocated port groups are each port addresses of the same externalsource network address (e.g., of an external source IP address) and areall assigned with respect to the same external destination IP address.In some embodiments, a particular port value is not available to beallocated to a connection unless that port is in a pre-allocated portgroup. In FIG. 5 , at time T=1, the pre-allocated port groups includetwo groups 512 and 514. Each of these groups contains 512 pre-allocatedports from the set of possible port addresses 500. Within eachpre-allocated port group, the source port addresses are contiguous.Group 512 includes ports 0-511 and group 514 includes ports 512-1023.

The process 400 (of FIG. 4 ) allocates/de-allocates (at 410) externalsource port addresses from the pre-allocated port groups. Ports areallocated (as needed) between machines of the datacenter and machines atexternal destination addresses. As FIG. 4 shows, operation 410 isapplied repeatedly as new connections are needed.

When the process 400 determines (at 415) that all pre-allocated ports inthe pre-allocated port groups have not been used (e.g., some ports arestill available in the pre-allocated port groups) the process 400determines (at 425) whether new connections that need a new port are tobe allocated or an existing connection has terminated and the port forthat connection needs to be de-allocated. When ports for new connectionsare being allocated more frequently than ports for old (terminated)connections are being de-allocated, the process 400 will eventuallydetermine (at 415) that all available ports of the pre-allocated groupsare in use (e.g., assigned to connections between machines of thedatacenter and the external destination address). In that case, theprocess 400 pre-allocates (at 420) additional port groups, then returnsto operation 425.

FIG. 5 illustrates an example of such dynamic modifications of multiplepre-allocated groups 510-530 from time T=1 to time T=3. In this example,600 external source port addresses in the pre-allocated port groups 512and 514 have been allocated by time T=1. The pre-allocated group 512 isshown completely shaded to indicate that all 512 ports in group 512 areallocated to various connections between the external source IP addressand the particular destination IP address associated with thepre-allocated port groups. Group 514 is partially shaded, with the portnumber 600 shown as the lowest available port, indicating that group 514still has available ports 600-1023 to be assigned to new connections.

Between time T=1 and time T=2 the process 400 (of FIG. 4 ) allocates (at410) additional external source port addresses from the pre-allocatedport group 514 (of FIG. 5 ) for new connections between the externalsource IP address and the destination IP address associated with theport groups. At first, each new allocated port address will be one ofthe available ports in the range 600-1023 from group 514. However, inthe illustrated example, the process 400 (of FIG. 4 ) eventuallyallocates all the available ports in the pre-allocated port group 514.After that, the process 400 determines (at 415) that all pre-allocatedports in the existing port groups are in use. The process 400 thenpre-allocates (at 420) an additional port group, group 522 of FIG. 5 .Group 522 also has 512 ports, specifically ports 1024-1535. The process400 (of FIG. 4 ) then continues to allocate ports (at 410) until allports of group 522 (of FIG. 5 ) are used, then pre-allocates (at 420)additional port group 524, which also has 512 ports, specifically ports1536-2047.

The process 400 (of FIG. 4 ), when no connection is beingallocated/de-allocated determines (at 430) whether it is time tore-examine port group allocation. In some embodiments, thisdetermination is made periodically (e.g., every 5 seconds, every 10seconds, etc.). In other embodiments, a state of the system may triggerthe re-examination (e.g., the process may re-examine group allocationwhen the existing pre-allocated ports are nearly all used and/or whenall ports of a pre-allocated group have become available either at allor for a threshold amount of time). When the process 400 determines (at430) that it is not time to re-examine port group allocation, theprocess 400 returns to operation 425.

When the process 400 determines (at 430) that it is time for are-examination and the re-examination determines that the number ofconnections that need external source ports to connect to the externaldestination address has changed, the process 400 will dynamically modify(at 435) the number of pre-allocated groups, by either increasing orreducing the number of pre-allocated groups. For example, the process400 (of FIG. 4 ) reduces (at 435) the number of groups by identifying apre-allocated port group that is no longer active and removing it afterde-allocating (at 410) all of the ports of that pre-allocated portgroup. Removing the group may mean deleting the group, setting the groupinto an idle state, or otherwise eliminating it from use. Aftermodifying the number of groups, the process 400 then returns tooperation 425 to determine whether connections need to beallocated/de-allocated.

In FIG. 5 , sometime between the process adding group 524 and time T=2,connections using ports in group 522 began to terminate. When thoseconnections terminated, the ports became available again and the process400 de-allocated (at 410) the ports. The newly available ports are shownas the unshaded portion of group 522. In this example, sometime betweentime T=2 and time T=3, all of the connections in group 522 terminatedand the process 400 (of FIG. 4 ) dynamically modified (at 435) thenumber of groups by removing (e.g., deleting) group 522 (of FIG. 5 ). Inthe illustrated example, some connections using ports tracked by groups514 and 524 also terminated between T=2 and T=3, as shown by theunshaded portions of those groups. However, since not all connectionswith ports tracked by either of those groups terminated, the groups 514and 524 were not removed by process 400.

Although for simplicity of explanation, the description described portsas either only being allocated or only being de-allocated at varioustimes, the process 400 can allocate and de-allocate ports as necessary.For example, the process 400 can allocate a particular port for a newconnection after an old connection (to which that port was allocated)terminates. In some embodiments, the method preferentially assigns thelowest available ports to new connections, resulting in a tendency forhigher pre-allocated port groups to empty out sooner than lowerpre-allocated port groups as overall demand for connections to thedestination IP address decreases.

Although the above described figures show operations in a particularorder, one of ordinary skill in the art will realize that other ordersof operations are within the scope of the invention. For example, in theillustrated embodiment, a new port group is created once the last portof the groups are allocated, but in other embodiments, the new portgroup may be created when the number of available ports in thepre-allocated groups drops below a threshold, in anticipation of a needfor more capacity. In still other embodiments, a new port group may becreated only when a new connection needs a port and all available portsare already allocated.

The illustrated groups track port availability only for connectionsbetween one particular external source IP and one particular destinationIP. Although packets may come in to or be forwarded from the ports ofthe external source IP from other destination IP addresses, thoseconnections are not tracked by the illustrated pre-allocated port groupsof FIG. 5 , but rather by separate pre-allocated port groups (notshown). That is, in some embodiments, there is a separate set ofpre-allocated port groups for the external source address associatedwith each destination IP address. Furthermore, in some embodiments,there is a separate set of pre-allocated port groups for the externalsource address associated with each destination IP address/destinationport pair. Although the illustrated pre-allocated port groups are eachthe same size (contain the same number of ports), in other embodiments,some pre-allocated port groups may be different sizes than others.

The methods of some embodiments provide efficient ways to search themultiple pre-allocated port groups for available ports. In some methods,each pre-allocated port group is associated with a set of metadata. Theset of metadata for each group may include an indicator of how many ofthe ports tracked by that group are available and may also identify aspecific available port tracked by that group (e.g., the lowestavailable port number tracked by the group).

FIG. 6 conceptually illustrates a process 600 of some embodiments toidentify and allocate available ports using metadata of pre-allocatedport groups. The process 600 uses the metadata to determine whether theexisting pre-allocated port groups contain any available ports byiteratively examining metadata for each pre-allocated port group. Theprocess will be described with respect to FIG. 7 . FIG. 7 conceptuallyillustrates iteratively searching metadata of pre-allocated port groups.FIG. 7 includes 3 pre-allocated port groups, specifically twopre-allocated port groups 705 and 715 with four ports each and apre-allocated port group 725 with eight ports. The groups 705, 715, and725 each have accompanying metadata 710, 720, and 730, respectively.

The process 600 (of FIG. 6 ) begins when a new connection requires aport allocation. The process 600 identifies (at 605) a new connectionthat requires an external source port. The new connection will be aconnection between a particular external source IP address and aparticular external destination IP address. The pre-allocated portgroups to be examined are groups of ports for connections to thatparticular external destination address.

The process 600 examines (at 610) the metadata for an unexamined (so farin the port allocation process 600) pre-allocated port group. If theprocess 600 determines (at 615) that the metadata of the first groupindicates that no ports are available, the process 600 determines (at620) whether any other groups have not had their metadata examined andthen cycles through operations 610, 615, and 620 until it identifies (at615) an available port and proceeds to operation 630 (as described inthe next paragraph) or determines (at 620) that the metadata of the lastgroup indicates no available ports. If the metadata of every groupindicates no available ports, the process 625 creates a newpre-allocated port group and identifies a first port of the newpre-allocated port group as available before proceeding to operation630.

In the example of FIG. 7 the process 600 starts with the firstpre-allocated port group and proceeds sequentially until it finds anavailable port. The pre-allocated port group 705 has no available ports,as indicated by all four ports of the first pre-allocated port groupbeing shaded. Accordingly, the metadata 710 for group 705 shows no portsavailable. Therefore, the process 600 (of FIG. 6 ) determines, at 615,that the metadata identifies no port available in the first group. Theprocess 600 then determines (at 620) that the group was not the lastpre-allocated port group, because groups 715 and 725 have not beenexamined.

The process 600 then examines (at 610) the metadata for the nextpre-allocated port group and returns to operation 615 to determine theresults. In the example of FIG. 7 , the next group is pre-allocated portgroup 715. Pre-allocated port group 715 has two ports available, asindicated in the figure by two unshaded ports. The metadata 720indicates that two ports are available in the group and that the nextassignable port number is 5. Therefore the process 600 (of FIG. 6 )determines (at 615) that port 5 is available. The process 600 (of FIG. 6) then proceeds to operation 630 and allocates port 5 to the newconnection.

The process 600 then updates (at 635) the pre-allocated port group toindicate that the port is now allocated and updates the metadata toindicate the remaining number of available ports in the group and thenumber of the next available port. The process 600 then ends. In FIG. 7, at time T=3, the newly allocated port of group 715 has been shaded toindicate that the port has been allocated and metadata 720 has beenupdated to indicate only one available port rather than two (as was thecase at time T=2) and the next available port is identified in themetadata 720 as port number 6.

The metadata in the illustrated example includes the available port withthe lowest numerical value as the next port for allocation. However, inother embodiments some other port of the port group may be designated asthe next available port (e.g., the longest unused port, the highestnumbered available port, etc.).

As mentioned above, in some embodiments, the connection using anallocated port may terminate. When the connection terminates, the portis de-allocated. The method of some embodiments, updates (1) thepre-allocated port group to which the de-allocated port belongs and (2)the metadata for that group to indicate that the port is available to beallocated to a new connection for a new outgoing flow (and its replyflow). Furthermore, when all the ports in a particular pre-allocatedport group are de-allocated, the method of some embodiments removes thepre-allocated port group as well.

FIG. 8 illustrates a process 800 for de-allocating ports. The process800 will be described with reference to FIG. 9 . FIG. 9 illustratesmodifications of pre-allocated port groups upon the de-allocation ofports. The figure includes pre-allocated groups 905, 915, and 925 withmetadata sets 910, 920, and 930 respectively. The figure is shown attimes T=1 to T=3 as ports of pre-allocated group 905 are de-allocatedand at time T=4 after the process 800 has ended and a separate cleanupprocess of some embodiments has removed an empty (e.g., with noallocated ports) pre-allocated port group.

The process 800 (of FIG. 8 ) determines (at 805) that a connection usingan external source IP and port to connect to a particular externaldestination IP address has terminated. In some embodiments, thetermination of a connection may be determined by a process thatcalculates how much time has passed since a packet of an outgoing and/orincoming flow of that connection has been received and identifies theconnection as terminated after a threshold time has passed. Additionallyor alternatively, a connection may be explicitly terminated by atermination code received from one or both of the machines at theoriginal internal source address and the external destination address.

The process 800 then updates (at 810) the pre-allocated port group forthat port in the set of pre-allocated port groups associated with thatexternal destination to indicate that the port is available. Twoexamples of port-group updating are conceptually illustrated in FIG. 9 .At time T=1, the first port of pre-allocated port group 905 is shown byits shading as being allocated to a connection. By time T=2, the firstport has been de-allocated, as shown by the lack of shading in the firstport of group 905 at time T=2. Similarly, from time T=2 to time T=3, thelast remaining allocated port of group 905 is de-allocated, as shown bythe removal of its shading.

The process 800 (of FIG. 8 ) then updates (at 815) the metadata for thepre-allocated port group. In FIG. 9 , metadata 910 of group 905 isupdated from showing 2 available ports in the associated pre-allocatedport group 905 at time T=1 to showing 3 available ports at time T=2. Inthe illustrated embodiment, the metadata includes the first (lowest portnumbered) available port as the next available port for allocation. Thefirst available port identified in the metadata is updated from port 1at time T=1 to port 0 (the de-allocated port) at time T=2. The metadata910 is updated again at time T=3 to show 4 available ports in group 905.The process 800 (of FIG. 8 ) then ends.

In some embodiments, a separate clean-up process (e.g., as discussedwith respect to operation 430 and 435 of FIG. 4 ) removes pre-allocatedport groups with no allocated ports. In FIG. 9 , group 905 has aremaining allocated port at time T=2, so the group is not removed bysuch a clean-up process at that time. However, after the last remainingallocated port of group 905 is de-allocated, the group 905 is removed bythe clean-up process, leaving only groups 915 and 925 as activepre-allocated groups at time T=4.

One of ordinary skill in the art will understand that in the illustratedembodiments of FIGS. 8 and 9 the process 800 updates the metadata 910for the empty pre-allocated port group 905 before a separate clean-upprocess removes the group (e.g., using the metadata to identify thegroup as empty), while other embodiments may simply remove the groupwhen it becomes empty, without updating the metadata first. Although themetadata in the illustrated example includes the available port with thelowest numerical value as the next port for allocation, in otherembodiments some other port of the port group may be designated as thenext available port (e.g., the longest unused port, the highest numberedavailable port, etc.).

In some embodiments, the pre-allocated port groups are stored as bitmapchunks. In such embodiments, each bit of a bitmap chunk represents asingle port, with (i) the port number based on the position of the bitwithin the bitmap and (ii) the state of the port (allocated ornot-allocated) determined by the value of the bit. In some suchembodiments, as the number of connections to a destination IP grows,larger pre-allocated port groups are allocated (i.e., as larger bitmapchunks). For example, in some embodiments, the first two pre-allocatedchunks are implemented as two 512-bit chunks, the next fourpre-allocated chunks are implemented as four 1024-bit chunks, the nexteight pre-allocated chunks are implemented as eight 2048-bit chunks, andthe next eleven pre-allocated chunks are implemented as eleven 4096-bitchunks. In some embodiments, the maximum total number of bits in all thebitmap chunks for a given destination IP address (or destination IPaddress and port) is 64K. Other embodiments that implement pre-allocatedport groups as bitmaps may use different sizes of bitmap chunks to reach64K bits or may have other maximum numbers of bits (e.g., in order tolimit the number of ports available for connections to a particulardestination address).

As mentioned above, the method of some embodiments uses hash functionsto quickly locate connection-tracking records. FIG. 10 illustrates theuse of a hash function to locate tracking-records. The figure includes aset of packet header data 1005, a hash function 1010, and hash buckets1015 (e.g., rows of a hash table) with tracking record entries 1017.Some of the entries are stored in linked lists 1020 and 1025. One ofordinary skill in the art will understand that although both the data towhich the hash function is applied in this figure, and the match andaction attributes described with respect to FIGS. 2 and 3 , are packetheader data, some or all of the packet header data on which the hash isperformed may not be the same as any of the match or action attributes.

In the illustrated embodiment of FIG. 10 , when a packet comes into thenetwork, the method identifies the external destination IP (EDI) andexternal destination port (EDP) values from the packet header data 1005.A hash function 1010 is then applied to the EDI/EDP pairs to identify aparticular bucket 1015 of a hash table in which the connection-trackingrecords 1017 are stored.

In some embodiments, hash collisions occur for some flows, such as whenmultiple connection-tracking records 1017 have the same EDI/EDP pair orwhen connection-tracking records 1017 of flows with different EDI/EDPpairs hash to the same location. In the event of a hash collision theconnection-tracking record entries 1017 are stored in linked lists, suchas linked list 1020, which includes three connection-tracking records1017 from EDI2/EDP3 and linked list 1025 which contains oneconnection-tracking record 1017 for flows with the EDI1/EDP2 pair andtwo connection-tracking records 1017 for flows with the EDI2/EDP4 pair.

One of ordinary skill in the art will understand that the methods ofsome embodiments may use other packet header values as inputs for hashfunctions. For example, the methods of some embodiments may use theentire connection tuple of a packet or any chosen subset of its valuesas inputs for a hash function. Some embodiments may use multipleinstances of one or more hash functions to populate multiple hash tableswith the same general purpose. For example, some embodiments may providea separate hash table for connection-tracking records for each externalsource IP address. Similarly, in some embodiments, the methods may usemultiple hash functions for different purposes. For example, someembodiments use a hash function to sort entries of a hash tablecontaining connection-tracking records and another hash function toidentify where pre-allocation port groups are stored for each ESI/EDIpair or EST/EDI/EDP combination. Additionally, although the illustratedembodiment of FIG. 10 shows connection-tracking records 1017 in thebuckets 1015, in some embodiments, the hash function identifies thelocation in the hash table of a pointer that in turn identifies a firstentry in a linked list of connection-tracking records, the location ofanother hash table, the location of a binary tree, etc.

FIG. 11 conceptually illustrates an electronic system 1100 with whichsome embodiments of the invention are implemented. The electronic system1100 can be used to execute any of the control, virtualization, oroperating system applications described above. The electronic system1100 may be a computer (e.g., a desktop computer, personal computer,tablet computer, server computer, mainframe, a blade computer etc.), orany other sort of electronic device. Such an electronic system includesvarious types of computer readable media and interfaces for variousother types of computer readable media. Electronic system 1100 includesa bus 1105, processing unit(s) 1110, a system memory 1125, a read-onlymemory 1130, a permanent storage device 1135, input devices 1140, andoutput devices 1145.

The bus 1105 collectively represents all system, peripheral, and chipsetbuses that communicatively connect the numerous internal devices of theelectronic system 1100. For instance, the bus 1105 communicativelyconnects the processing unit(s) 1110 with the read-only memory 1130, thesystem memory 1125, and the permanent storage device 1135.

From these various memory units, the processing unit(s) 1110 retrieveinstructions to execute and data to process in order to execute theprocesses of the invention. The processing unit(s) may be a singleprocessor or a multi-core processor in different embodiments.

The read-only-memory (ROM) 1130 stores static data and instructions thatare needed by the processing unit(s) 1110 and other modules of theelectronic system. The permanent storage device 1135, on the other hand,is a read-and-write memory device. This device is a non-volatile memoryunit that stores instructions and data even when the electronic system1100 is off. Some embodiments of the invention use a mass-storage device(such as a magnetic or optical disk and its corresponding disk drive) asthe permanent storage device 1135.

Other embodiments use a removable storage device (such as a floppy disk,flash drive, etc.) as the permanent storage device. Like the permanentstorage device 1135, the system memory 1125 is a read-and-write memorydevice. However, unlike storage device 1135, the system memory is avolatile read-and-write memory, such a random access memory. The systemmemory 1125 stores some of the instructions and data that the processorneeds at runtime. In some embodiments, the invention's processes arestored in the system memory 1125, the permanent storage device 1135,and/or the read-only memory 1130. From these various memory units, theprocessing unit(s) 1110 retrieve instructions to execute and data toprocess in order to execute the processes of some embodiments.

The bus 1105 also connects to the input and output devices 1140 and1145. The input devices 1140 enable the user to communicate informationand select commands to the electronic system. The input devices 1140include alphanumeric keyboards and pointing devices (also called “cursorcontrol devices”). The output devices 1145 display images generated bythe electronic system 1100. The output devices 1145 include printers anddisplay devices, such as cathode ray tubes (CRT) or liquid crystaldisplays (LCD). Some embodiments include devices such as a touchscreenthat function as both input and output devices.

Finally, as shown in FIG. 11 , bus 1105 also couples electronic system1100 to a network 1165 through a network adapter (not shown). In thismanner, the computer can be a part of a network of computers (such as alocal area network (“LAN”), a wide area network (“WAN”), or an Intranet,or a network of networks, such as the Internet. Any or all components ofelectronic system 1100 may be used in conjunction with the invention.

Some embodiments include electronic components, such as microprocessors,storage and memory that store computer program instructions in amachine-readable or computer-readable medium (alternatively referred toas computer-readable storage media, machine-readable media, ormachine-readable storage media). Some examples of such computer-readablemedia include RAM, ROM, read-only compact discs (CD-ROM), recordablecompact discs (CD-R), rewritable compact discs (CD-RW), read-onlydigital versatile discs (e.g., DVD-ROM, dual-layer DVD-ROM), a varietyof recordable/rewritable DVDs (e.g., DVD-RAM, DVD-RW, DVD+RW, etc.),flash memory (e.g., SD cards, mini-SD cards, micro-SD cards, etc.),magnetic and/or solid state hard drives, read-only and recordableBlu-Ray® discs, ultra-density optical discs, any other optical ormagnetic media, and floppy disks. The computer-readable media may storea computer program that is executable by at least one processing unitand includes sets of instructions for performing various operations.Examples of computer programs or computer code include machine code,such as is produced by a compiler, and files including higher-level codethat are executed by a computer, an electronic component, or amicroprocessor using an interpreter.

While the above discussion primarily refers to microprocessor ormulti-core processors that execute software, some embodiments areperformed by one or more integrated circuits, such asapplication-specific integrated circuits (ASICs) or field-programmablegate arrays (FPGAs). In some embodiments, such integrated circuitsexecute instructions that are stored on the circuit itself.

As used in this specification, the terms “computer”, “server”,“processor”, and “memory” all refer to electronic or other technologicaldevices. These terms exclude people or groups of people. For thepurposes of the specification, the terms display or displaying meansdisplaying on an electronic device. As used in this specification, theterms “computer readable medium,” “computer readable media,” and“machine readable medium” are entirely restricted to tangible, physicalobjects that store information in a form that is readable by a computer.These terms exclude any wireless signals, wired download signals, andany other ephemeral signals.

This specification refers throughout to computational and networkenvironments that include virtual machines (VMs). However, virtualmachines are merely one example of data compute nodes (DCNs) or datacompute end nodes, also referred to as addressable nodes. DCNs mayinclude non-virtualized physical hosts, virtual machines, containersthat run on top of a host operating system without the need for ahypervisor or separate operating system, and hypervisor kernel networkinterface modules.

VMs, in some embodiments, operate with their own guest operating systemson a host using resources of the host virtualized by virtualizationsoftware (e.g., a hypervisor, virtual machine monitor, etc.). The tenant(i.e., the owner of the VM) can choose which applications to operate ontop of the guest operating system. Some containers, on the other hand,are constructs that run on top of a host operating system without theneed for a hypervisor or separate guest operating system. In someembodiments, the host operating system uses name spaces to isolate thecontainers from each other and therefore provides operating-system levelsegregation of the different groups of applications that operate withindifferent containers. This segregation is akin to the VM segregationthat is offered in hypervisor-virtualized environments that virtualizesystem hardware, and thus can be viewed as a form of virtualization thatisolates different groups of applications that operate in differentcontainers. Such containers are more lightweight than VMs.

Hypervisor kernel network interface modules, in some embodiments, arenon-VM DCNs that include a network stack with a hypervisor kernelnetwork interface and receive/transmit threads. One example of ahypervisor kernel network interface module is the vmknic module that ispart of the ESXi™ hypervisor of VMware, Inc.

It should be understood that while the specification refers to VMs, theexamples given could be any type of DCNs, including physical hosts, VMs,non-VM containers, and hypervisor kernel network interface modules. Infact, the example networks could include combinations of different typesof DCNs in some embodiments.

While the invention has been described with reference to numerousspecific details, one of ordinary skill in the art will recognize thatthe invention can be embodied in other specific forms without departingfrom the spirit of the invention. In addition, a number of the figuresconceptually illustrate processes. The specific operations of theseprocesses may not be performed in the exact order shown and described.The specific operations may not be performed in one continuous series ofoperations, and different specific operations may be performed indifferent embodiments. Furthermore, the process could be implementedusing several sub-processes, or as part of a larger macro process. Thus,one of ordinary skill in the art would understand that the invention isnot to be limited by the foregoing illustrative details, but rather isto be defined by the appended claims.

1-24. (canceled)
 25. A method of allocating external source portaddresses for a plurality of connections that share a limited set ofexternal source IP addresses for connections to a destination IP addressoutside of a network, the method comprising: specifying a plurality ofpre-allocated port groups with each group comprising a plurality ofexternal source port addresses; allocating, for new connections to thedestination IP address, external source port addresses from thepre-allocated groups when external source port addresses are availablein the pre-allocated groups; and dynamically modifying a number of thepre-allocated groups as a number of connections increases or decreasesto destinations outside of the network.
 26. The method of claim 25,wherein said specifying, allocating and dynamically modifying provide anefficient mechanism for (i) tracking source port addresses assigned toconnections to the destination IP address, and (ii) allocating newsource port addresses when no previously pre-allocated source portaddresses are available.
 27. The method of claim 25, wherein eachpre-allocated group includes a plurality of source port addresses. 28.The method of claim 27, wherein the source port addresses in eachpre-allocated group are contiguous addresses in a range.
 29. The methodof claim 25, wherein the plurality of pre-allocated groups is a firstset of pre-allocated groups, and dynamically modifying the numbercomprises identifying a new connection for which an external source porthas to be assigned; determining that the first set of pre-allocatedgroups does not have an external source port address available to assignto the new connection; and specifying a second set of pre-allocatedgroups of external port addresses and allocating an external portaddress from the second set of pre-allocated groups.
 30. The method ofclaim 25 further comprising: determining that a pre-allocated group doesnot have any available port for allocation to a new packet flow; andselecting another pre-allocated group from which a port should beselected for the new packet flow.
 31. The method of claim 25 furthercomprising: defining different sets of pluralities of pre-allocated portgroups, each set associated with a different external destination IPaddress; identifying, for a new packet flow, the port-group setassociated with an external destination IP address stored in a headerfield of the new flow; and allocating, for the new packet flow, anexternal port address from a particular pre-allocated port group in theidentified port-group set.
 32. The method of claim 31 furthercomprising: defining, for each external destination IP address, aconnection-tracking data store for storing connection-tracking recordsthat map allocated external source port addresses to internal source IPand port addresses within the network, said connection-tracking recordsfor use in performing network address translation on packets of flowsexiting the network and performing destination address translation onpackets of flows entering the network.
 33. The method of claim 25further comprising: allocating a bitmap of available source ports;allocating contiguous blocks of source ports in the bitmap to differentpre-allocated port groups; and using the bitmap to identify thepre-allocated port groups and adjust the number of pre-allocated portgroups.
 34. A non-transitory machine readable medium storing a programwhich when executed by one or more processing units allocates externalsource port addresses for a plurality of connections that share alimited set of external source IP addresses for connections to adestination IP address outside of a network, the program comprising setsof instructions for: specifying a plurality of pre-allocated port groupswith each group comprising a plurality of external source portaddresses; allocating, for new connections to the destination IPaddress, external source port addresses from the pre-allocated groupswhen external source port addresses are available in the pre-allocatedgroups; and dynamically modifying a number of the pre-allocated groupsas a number of connections increases or decreases to destinationsoutside of the network.
 35. The non-transitory machine readable mediumof claim 34, wherein said specifying, allocating and dynamicallymodifying provide an efficient mechanism for (i) tracking source portaddresses assigned to connections to the destination IP address, and(ii) allocating new source port addresses when no previouslypre-allocated source port addresses are available.
 36. Thenon-transitory machine readable medium of claim 34, wherein eachpre-allocated group includes a plurality of source port addresses. 37.The non-transitory machine readable medium of claim 36, wherein thesource port addresses in each pre-allocated group are contiguousaddresses in a range.
 38. The non-transitory machine readable medium ofclaim 34, wherein the plurality of pre-allocated groups is a first setof pre-allocated groups, and dynamically modifying the number comprisesidentifying a new connection for which an external source port has to beassigned; determining that the first set of pre-allocated groups doesnot have an external source port address available to assign to the newconnection; and specifying a second set of pre-allocated groups ofexternal port addresses and allocating an external port address from thesecond set of pre-allocated groups.
 39. The non-transitory machinereadable medium of claim 34, wherein the program further comprises setsof instructions for: defining different sets of pluralities ofpre-allocated port groups, each set associated with a different externaldestination IP address; identifying, for a new packet flow, theport-group set associated with an external destination IP address storedin a header field of the new flow; and allocating, for the new packetflow, an external port address from a particular pre-allocated portgroup in the identified port-group set.
 40. The non-transitory machinereadable medium of claim 39, wherein the program further comprises setsof instructions for: defining, for each external destination IP address,a connection-tracking data store for storing connection-tracking recordsthat map allocated external source port addresses to internal source IPand port addresses within the network, said connection-tracking recordsfor use in performing network address translation on packets of flowsexiting the network and performing destination address translation onpackets of flows entering the network.
 41. The non-transitory machinereadable medium of claim 34, wherein the program further comprises setsof instructions for: allocating a bitmap of available source ports;allocating contiguous blocks of source ports in the bitmap to differentpre-allocated port groups; and using the bitmap to identify thepre-allocated port groups and adjust the number of pre-allocated portgroups.